대규모 병렬 처리 프로그래밍: 실습 중심 접근법: CUDA 실행 모델: 호스트 대 장치

CUDA 실행 모델은 당신의 컴퓨터를 고성능 이질적 시스템으로 변환합니다. 다음을 상상해 보세요: 거대한 지휘관(호스트/중앙처리장치) 그리고 수천 명의 병사(장치/그래픽처리장치)지휘관은 복잡한 논리와 의사결정을 담당하고, 병사들은 막대하고 반복적인 작업을 동시에 수행합니다.

1. 아키텍처의 차이

호스트 호스트 는 복잡한 제어 흐름과 순차적 작업에 최적화된 중앙처리장치입니다. 반대로, 장치 는 수천 개의 간단한 코어를 포함하며, 거대한 데이터셋에서 동일한 명령어를 동시에 실행하도록 설계된 처리량 최적화 그래픽처리장치입니다.

2. 실행 리듬

CUDA 프로그램은 여러 단계로 나뉜 실행 흐름을 가집니다. 실행은 "순차 코드"를 위해 호스트에서 시작됩니다. 프로그램이 "병렬 커널"에 도달하면, 그리드 스레드의 그리드를 장치에 배포합니다. 장치가 막대한 작업을 완료하면 제어권이 다시 호스트로 반환됩니다.

3. 성능 특화

이 모델은 둘의 강점을 활용합니다: 중앙처리장치는 시스템 자원과 복잡한 분기 구조를 관리하고, 그래픽처리장치는 SPMD(Single-Program, Multiple-Data) 논리를 사용하여 데이터 요소를 병렬로 처리합니다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Which architecture is characterized as being 'throughput-optimized'?

The Host (Intel® CPU)

The Device (NVIDIA® GPU)

The System RAM

The PCIe Bus

QUESTION 2

The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.

float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);

float Nd, Pd; malloc(&Nd, size); ... free(Nd);

float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;

int Nd, Pd; Nd = new float[size]; ... free(Nd);

QUESTION 3

In the CUDA execution model, where does a program always begin its execution?

On the Device (GPU)

Simultaneously on both

On the Host (CPU)

In the Global Memory

QUESTION 4

What happens when the Host encounters a phase with rich data parallelism?

It speeds up its clock frequency.

It launches a Kernel onto the Device.

It stores the data in the Host Cache.

It converts the code to Python.

QUESTION 5

A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?

The G80 cannot handle 1024 blocks.

The total number of threads exceeds 1 million.

The configuration results in 1024 threads per block, exceeding the 512 hardware limit.

Matrix multiplication is not data parallel.